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^ 12 Summary 

^ 13 Approximate Bayesian computation (ABC) or likelihood-free inference algorithms are used 

14 find approximations to posterior distributions without making explicit use of the likelihood 

^ 15 function, depending instead on simulation of sample data sets from the model. In this paper we 

16 show that under the assumption of the existence of a uniform additive model error term, ABC 

^^17 algorithms give exact results when sufficient summaries are used. This interpretation allows the 

^ 1^ approximation made in many previous application papers to be understood, and should guide the 

^ i9 choice of metric and tolerance in future work. ABC algorithms can be generalized by replacing 

"j^ 20 the 0-1 cut-off with an acceptance probability that varies with the distance of the simulated data 

"^21 from the observed data. The acceptance density gives the distribution of the error term, enabling 
the uniform enor usually used to be replaced by a general distribution. This generalization can 

I 23 also be applied to approximate Markov chain Monte Carlo algorithms. In light of this work, 

J> 24 ABC algorithms can be seen as calibration techniques for implicit stochastic models, inferring 

25 parameter values in light of the computer model, data, prior beliefs about the parameter values, 

26 and any measurement or model errors. 

• 28 Some key words: Approximate Bayesian computation; calibration; implicit inference; likelihood-free inference; Monte 

29 Carlo. 
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1. Introduction 



k>( 33 Approximate Bayesian computation (ABC) algorithms are a group of methods for perform- 

''^ 34 ing Bayesian inference w ithout the need for expl i cit evaluation of the model likelihood function 



5^ 35 (iBeaumont et all (I2OO2I) : iMarjoram eTa r ('2003'); 'Sisson et all (|2007|) ). The algorithms can be 



36 used with impUcit computer models Jpig gle & Gratton (1984 )) that generate sample data sets 



37 rather than return likelihoods. ABC methods h ave become popular in the biolog i cal sc iences 



38 with apphc ations in genet i cs (se e, for example, ISiegmund et all (120081) : IFoU et all (120081)). epi 

39 demiology ("Blum & Tran' (12008'); 'Tanaka et al. ' ('2006')) and population biology (iRatmann et al.l 

40 12007); Hamilton et al. (2005); Cornuet et al. (2008)) most common. This popularity is primarily 

41 due to the fact that the likelihood function, which can be difficult or impossible to compute for 

42 some models, is not needed in order to do inference. However, despite their popularity little is 

43 known about the quality of the approximation they provide beyond results shown in simulation 

44 studies. 

45 In this paper we give a framework in which the accuracy of ABC methods can be understood. 

46 The notation throughout this paper is as follows. Let 9 denote the vector of unknown model 

47 parameters we wish to infer, and let r]{-) denote the computer model. We assume 7]{-) is stochas- 

48 tic, so that the model repeatedly run at 9 will give a range of possible model outputs, and write 
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X ~ rjiO) to denote that X has the same distribution as the model run at 6. To distinguish the 
model output from the observed data, let D denote the observations. The aim is to calibrate the 
model to the data, in order to learn about the true value of the parameter. The Bayesian approach 
is to find the posterior distribution of 9 given D, given by 

' ^) = n{D) ■ 

Throughout this paper, 7r(-) is used to denote different probability densities, and ir{- \ •) condi- 
tional densities, with the context clear from the arguments. Above, 7r{9) is the prior distribution, 
7r(L' I 9) is the likelihood of the data under the model given parameter 9 (the probability distri- 
bution of r]{9)), 'k{9 I D) is the posterior distribution, and vr(D) is the evidence for the model. 

It is usual in Bayesian inference to find that the normalizing constant vr(D) is intract able, 
and a wide range of Monte Carlo techniques have been developed to deal with this case (iLiu 
Doubly-intractable distributions are distributions which have a likelihood function 
■k{D I 9) = q{D I 9)/c{9) which is known only up to a normalizing constant, c(9) , which is in- 
tract able. Standard Monte Carlo techniques do not apply to these distributions, and lMurray et al. 



(I2OO6) have developed algorithms which can be used in this case. ABC methods are Monte 
Carlo techniques developed for use with completely-intractable distributions, where the likeli- 
hood function 'k{D \ 9) is not even known up to a normalizing constant. ABC algorithms, some- 
times called likelihood-free algorithms, enable inference using only simulations generated from 
the model, and do not require any evaluation of the likelihood. The most basic form of the ABC 
algorithm is based on the rejection algorithm, and is as follows: 



Algorithm A: approximate rejection algorithm 

Al Draw 6* ~ 7r(6') 

A2 Simulate X from the model X ~ 'q{9) 
A3 Accept 9 if D) < 6. 



Here, p{-, •) is a distance measure on the model output space, and 5 is a tolerance determining 
the accuracy of the algorithm. Accepted values of 9 are not from the true posterior distribution, 
but from an approximation to it, written 7r(0 | p{D, X) < d). When (5 = this algorithm is exact 
and gives draws from the posterior distribution it{9 \ D), whereas as 5 ^ 00 the algorithm gives 
draws from the prior. While smaller values of 5 lead to samples which better approximate the 
true posterior, they also lead to lower acceptance rates in step A3 than larger values, and so 
more computation must be done to get a given sample size. Consequently, the tolerance 5 can be 
considered as controlling a trade-off between computability and accuracy. 

Several extensions have been made to the approximate rejection algorithm. If the data are high 
dimensional, then a standard change to the algorithm is to summarize the model output and data, 
using a summary statistic S{-) to project X and D onto a lower dimensional space. Algorithm A 
is then changed so that step A3 reads 

A3' Accept 9 if S{D)) < 5. 

Ideally, S{-) should be chosen to be a sufficient statistic for 9. However, if the likelihood is 
unknown, then sufficient statistics cannot be identified. Summarizing the data and model output 
using a non-sufficient summary adds another layer of approximation on top of that added by the 
use of the distance measure and tolerance, but again, it is not known what effect any given choice 
for S{-) has on the approximation. 



Approximate Bayesian computation (ABC) gives exact results under the assumption of model error?! 
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Beaumont et alj (12002!) contains two innovations; they replace the discrete 0-1 cut-off in step 
A3 with a weighting scheme, using an Epanechnrkov kernel to weight each value of 9 according 
to the value of the metric p{D, X), so that large weights are assigned to values of 9 which pro- 
duce model output close to the measurement and small weights to those values which produce 
output distant from D. They then use the weighted sample [Oi, wi) to train a local-linear regres- 
sion to model the posterior density. ' Blum & FrancoisI (1200 8') have since extended this by using an 
adaptive heteroscedastic model to estimate the posterior from the weighted sample. While it has 
been shown that using a weighting scheme improves the accuracy of the approximate rejection 
algorithm in several situations, it is still unclear what the approximation represents or why any 
given weighting should be preferred over any other. 

In this paper it is shown that the basic approximate rejection algorithm can be interpreted as 
performing exact inferenc e in the presence of unif orm model or measurement error. Similarly, the 
weighting scheme used in Beaumont et al. ( 20021 ) corresponds to an error with an Epanechnrkov 
distribution. In other words, it is shown that ABC gives exact inference for the wrong model, 
and we give a distribution for the model error term for whatever choice of metric and tolerance 
are used. This interpretation allows us to show the effect a given choice of metric, tolerance 
and weighting have had in previous applications, and should provide guidance when choosing 
metrics and weightings in future work. It is also shown that Algorithm A can be generalized 
to give inference under the assumption of a completely flexible form for the model error. We 
discuss how to model the model error, and show how some models can be rewritten to give exact 
inference. 

Finally, ABC has been extended by iMarjoram et al. I (l2003l) fror n the rejectio n algo rithm 
to approximate Mark ov chain Monte Carlo algorithms, and by ISisson et all (120071) and 
Beaumont et al. (l2008 h to approximate sequential Monte Carlo algorithms. We extend the ap- 
proximate Markov chain Monte Carlo algorithm to give inference for a general form of error, 
and suggest methods for calculating Bayes factors and integrals for completely-intractable dis- 
tributions. 



2. Interpreting ABC 

In this section a framework is described which enables the effect a given metric and weighting 
have in ABC algorithms to be understood. This will then allow us to improve the inference by 
carefully choosing a metric and weighting which more closely represents our true beliefs. The 
key idea is to assume that there is a discrepancy between the best possible model prediction 
and the data. This discrepancy represents either measurement error on the data, or model error 
describing our statistical beliefs about where the model is wrong. George Box famously wrote 
that 'all models are wrong', and in order to link models to reality it is necessary to account for 
this model error wh en perforrn i ng inf er ence. In the context of deterministic models , this p ractice 
is well established ('Campbelll (120061) : iGoldstein & Rousieil (l2008l) : iHigdon et all (12008)). and 
should also be undertaken when linking stochastic models to reality, despite the fact that the 
variability in the model can seemingly explain the data as they are. 

The framework in t roduc ed here uses the best input approach, similar to that given in 
Kennedy & Q'HaganI (1200 ih . We assume that the measurement D can be considered as a re- 
alization of the model run at its best input value, 9, plus an independent error term e 



D = ri{e)+e. 



(1) 



The error e might represent measurement error on D, or model error in ?](•), or both, in which 
case we write e = ei + 62. Discussion about the validity of Equation ([T]), and what e represents 
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145 and how to model it are delayed until Section 3, and for the time being we simply consider e to 

146 have density 7re(-). The aim is to describe our posterior beliefs about the best input 9 in light of 

147 the error e, the data D, and prior beliefs about 9. Consider the following algorithm: 
148 

149 

Algorithm B: probabilistic approximate rejection algorithm 

l^l Bl Draw ~ 7r(6') 

J52 B2 Simulate X from the model X ~ r]{9) 

153 B3 Accept 9 with probability IEdItJQ._ 

154 

155 . X , 

Here, c is a constant chosen to guarantee that 7re\D — X) /c defines a probability. For most cases 

we will expect e to have a modal value of 0, and so taking c = Tr^{0) will make the algorithm valid 

and also ensure efficiency by maximizing the acceptance rate. If D and X are both real valued 

arrays of matching dimension, then D — X is simply the arithmetic pairwise difference. How- 

ever, if D and X are not real- valued, for example, D and X could both be gene sequences, then 

D — X represents the difference between the two data sets (we could write D — X = p{D, X)) 

^iid 7re(-) is a distribution on this space of differences. 

The main innovation in this paper is to show that Algorithm B gives exact inference for the 

statistical model described above by Equation ([T|l. This is essentially saying that ABC gives exact 

inference, but for the wrong model. 

166 

167 Theorem 1. Algorithm B gives draws from the posterior distribution 'k{9 \ D) under the 

168 assumption that D = r]{9) + e and e ~ vre(-) independently ofr]{9). 
169 

170 

Y-jy Proof. Let 

^ I 1 if is accepted 

173 ^ = { 

y otherwise. 

175 We then find that 

176 r 

177 pr(/ = l\9)= / pr(/ = 1 | 7/(6*) = x, 9)-k{x \ 9)dx 

179 = / 7r(x I 9)ax. 

180 J ^ 

181 This gives that the distribution of accepted values of is 



190 
191 
192 



182 

183 7r(0|/=l) 



t:{9) J tt,{D-x)tt{x \ 9)dx 



V I ; Jt,(^0>) Jt,^^D _x)7r{x\e')dxd9'' 

185 To complete the proof we must find the posterior distribution of the best model input given 

186 the data D under the assumption of model error. Note that tt{D \ r]{6) = x) = TT^iD — x) which 

187 implies that the likelihood of 6 is 
188 

189 7r(L> I ^) = / 7r(L> I r]{e) = x,e)-K{x \ 9)dx 

IQO J 

Tr^{D — x)'ir{x \ 9)dx. 



Approximate Bayesian computation (ABC) gives exact results under the assumption of model errorS 

193 Consequently, the posterior distribution of 9 is 

194 

195 I ^) _ T^iO) f - x)7:ix \ 9)dx 



196 J tt{9) f Tr,{D - x)tt{x \ 9)dxd9 

1 Q7 

which matches the distribution of accepted values from Algorithm B. □ 

198 



To illustrate the algorit hm, we consider the toy example used in lSisson et al. and again 



in Beaumont et al. (l2008h where analytic expressions can be calculated for the approximations. 

Example 1 . Assume the model is a mixture of two normal distributions with a uniform prior 
for the mean: 

204 111 

205 m ~ -M{9, 1) + -M{9, — ), 9 ~ Z^[-10, 10]. 
206 

2Q7 Further assume that we observe D = 0, but that there is measurement error e on this data. If 

2Q§ e ~ 5, S], which is the assumption made when using Algorithm A with p{x, 0) = \x\, then it 

2Q9 is possible to show that the approximation is 

210 TT{9\er^ U[-6, 6],D = 0) (x^{6 -9) - ^{-6 -9) + $(10(e - 9)) - $(-10(e + 9)) 

211 

2^2 for e [—10, 10], where ^{■) is the standa rd Gaussian dist ribution function. Note that this is 



2^3 slightly different to the distribution given in lBeaumont et al.l (12008 ). An alternative to assuming 

2^4 uniform error, is to suppose that the error has a normal distribution e ~ M{0, <5^/3). It can then 

2^5 be shown that the posterior distribution of 9 is 

216 52 ^ 52 1 1 52 

217 ^(^ I e ~ -^(0' y ^ = 0) oc -m 0, 1 + y ) + ^^J^ + j) 
218 
219 
220 



truncated onto [—10, 10]. This is the approximation found when using Algorithm B with a Gaus- 
sian acceptance kernel, where (?!>(•; p, cj^) is the probability density function of a Gaussian dis- 

22^ tribution with mean and variance cj^. The value of the variance, (5^/3, is chosen to be equal 

222 to the variance of a U[—S, 6] random variable. For large values of the tolerance 6, the difference 

223 between the two approximations can be significant (see Figure [Hi, but in the limit as S tends to 

224 zero, the two approximations will be the same, corresponding to zero error. 

225 
226 

227 3. Model discrepancy 

228 The interpretation of ABC given by Theorem 1 allows us to revisit previous analyses done 

229 using the ABC algorithm, and to understand the approximation in the posterior in terms of the 

230 distribution implicitly assumed for the error term. If the approximate rejection algorithm (Algo- 

231 rithm A) was used to do the analysis, we can see that this is equivalent to using the acceptance 

232 probability 
233 

234 ^,(r) _ /lif p(r) < 6 

235 c 1 otherwise 
236 
237 
238 
239 

240 er^U{x : p{x,D) <6}. 



where r is the distance between the simulated and observed data. This says that Algorithm A 
gives exact inference for the model which assumes a uniform measurement error on the region 
defined by the 0-1 cut-off, i.e.. 
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Fig. 1. The posterior distributions found when using Algo- 
rithm A (solid line) and Algorithm B (dashed line) with a 
Gaussian acceptance kernel. The left plot is for 5 = 1 and 
the right plot for 5 = 0.1. 



If •) is a Euclidean metric, p{D, x) = {x — D) 



[X 



D), this is equivalent to assuming uni- 



form measurement error on a ba ll of radius 5 abq i it D. 

The weighting scheme used in Beaumont et al.l ( 2002h . and since used in numerous application 
papers, used an Epanechnikov kernel -fT^dlx — to weight each value of 6. The form of this 
density is 



Ks{r) 



_3_ 



1 



<<5- 



(2) 



The kernel gives the probability density function of the error term, and thus analysis done using 
an Epanechnikov kernel give results which assume error with this distribution. In most situations, 
it is likely to be a poor choice for a model of the measurement error, because the tails of the 
distribution are short, with zero mass outside of the interval [—5, 5\. 

There are two ways we can choose to view the error term; either as measurement error or model 
error. Interpreting e to represent measurement error is relatively straight forward, as scientists 
usually hold beliefs about the distribution and magnitude of measurement error on their data. 
For most problems, assumptions of uniform measurement error will be inappropriate, and so 
using Algorithm A with a 0-1 cut-off will be inappropriate. But we have shown how to replace 
this uniform assumption with a distribution which more closely represents the beliefs of the 
scientist. Although the distribution of the measurement error will often be completely specified 
by the scientist, for example zero-mean Gaussian error with known variance, it is possible to 
include unknown parameters for the distribution of e in 6* and infer them along with the model 
parameters. Care needs to be taken to choose the constant c so that the acceptance rate in step B3 
is less than one for all values of the parameter, but other than this it is in theory simple to infer 
error parameters along with the model parameters. So for example, if e ~ A/^(0, ci^), where is 
unknown, we could include in 6. 
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289 Some models have sampling or measurement error built into the computer code so that the 

290 model output includes a realization of this noise. Rather than coding the noise process into the 

29 1 model, it will sometimes be possible to rewrite the model so that it outputs the latent underlying 

292 signal. If the likelihood of the data given the latent signal is computable (as it often is), then it 

293 may be possible to analytically account for the noise with the acceptance probability vre(-). ABC 

294 methods have proven most popular in fields such as genetics, epidemiology, and population bi- 

295 ology, where a common occurrence is to have data generated by sampling a hidden underlying 

296 tree structure. In many cases, it is the partially observed tree structure which causes the likeli- 

297 hood to be intractable, and given the underlying tree the sampling process will have a known 

298 distribution. If this is the case (and if computational constraints allow), we can use the proba- 

299 bilistic ABC algorithm to do the sampling to give exact inference without any assumption of 

300 model error. Note that if the sampling process gives continuous data, then exact inference using 

301 the rejection algorithm would not be possible, and so this approach has the potential to give a 

302 significant improvement over current methods. 



Example 2. To illustrate the idea of rewriting t he model in order to do analytic sampling. 
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316 

itly and so we cannot use a likelihood based approach to find the posterior dist r ibution of the 
unknown parameter 6 = (A, r, a). The ABC approach used in Plagnol & Tavara (120041) was to 



we describe a version of the problem considered in iPlagnoI & Tavara (120041) . Their aim was to 
use the primate fossil record to date the divergence time of the primates. They used an inho- 
mogeneous branching process to model speciation, with trees rooted at time t = t, and simu- 
lated forwards in time to time t = 0, so that the depth of the tree, r, represents the divergence 
time of interest. The branching process is par ametrized by A, which can either be estimated and 
fixed, or treated as unknown and given a prior distribution. Time is split into geologic epochs 
T < tfc < ■ ■ • < ti < 0, and the data consist of counts of the number of primate species that have 
been found in each epoch of the fossil record, D = {Di, . . . , Dk). Fossil finds are modelled by 
a discrete marking process on the tree, with each species having equal probability a of being 
preserved as a fossil in the record. If we let Ni be the cumulative number of branches that ex- 
ist during any point of epoch i, then the model used for the fossil finds process can be written 
as Di ~ Binomia^A'i, a). The distribution of iV = (A'^i, . . . , cannot be calculated explic- 



draw a value of 6 from its prior, simulate a sample tree and fossil finds, and then count the num- 
ber of simulated fossils in each epoch to find a simulated value of the data X. They then accepted 
6 if p{D, X) < 6 for some metric p(-, •) and tolerance 6. This gives an approximation to the pos- 
terior distribution of the parameter given the data and the model, where the approximation can 
be viewed as model or measurement error. 

However, instead of approximating the posterior, it is possible in theory to rewrite the model 
and perform the sampling analytically to find the exact posterior distribution: 



318 
319 
320 
321 
322 
323 
324 
325 

326 I. Draw (9 = (A,p,a) ~ 7r(-) 

327 2. Simulate a tree T using parameter A and count N 

328 3. Accept 9 with probabihty l\i=i ~ a)^''^\ 

330 This algorithm gives exact draws from the posterior distribution of 6 given D, and in theory 

33 1 there is no need for any assumption of measurement error. Note that 9 can include parameter a 

332 for the sampling rate, to be inferred along with the other model parameters. However, this makes 

333 finding a normalizing constant in step 3 difficult. Without a normalizing constant to increase the 

334 acceptance rate, applying this algorithm directly will be slow for many values of D and k (the 

335 choice of prior distribution and number of parameters we choose to include in 6 can also have 

336 a significant effect on the efficiency). A practical solution would be to add an error term and 
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337 assume the presence of measurement error on the data (which is Ukely to exist in this case), in 

338 order to increase the acceptance probability in step 3. Approaching the problem in this way, it is 

339 possible to carefully model the error on D and improve the estimate of the divergence time. 
340 

Using e to represent measurement error is straight forward, whereas using e to model the model 
discrepancy (to account for the fact the model is wrong) is harder to conceptualize and not as 
commonly used. For deterministic models, the idea of using a model error term when doing 
calibration or data as s imila tion is well established and described for a Bayesian framework in 
24^ Kennedy & Q'Hag^ (1200 1). They assume that the model run at its 'best' input, r/(^), is sufficient 

245 for the model when estimating 6. In other words, knowledge of the model run at its best input 

247 provides all the available information about the system for the purpose of prediction. If this is the 

24§ case, then we can define e to be the difference between ri{9) and D, and assume e is independent 

349 of ri{9). Note that the error is the difference between the data and the model run at its best input, 

350 and does not depend on 9. If we do not include an error term e, then the best input is the value of 

35 1 9 that best explains the data, given the model. When we include an error term which is carefully 

352 modelled and represents our beliefs about the discrepancy between ri{-) and reality, then it can be 

353 argued that 9 represents the 'true' value of 9, and that 7r(^ | D, e ~ 7re(-)) should be our posterior 

354 distribution for 9 in light of t he data and the model. 

355 For deterministic models, [Goldstein & Rougier (l2008h provide a framework to help think 



356 about the model discrepancy. To specify the distribution of e, it can help to break the discrep- 

357 ancy down into various parts: physical processes not modelled, errors in the specification of the 

358 model, imperfect implementation etc. So for example, if ri{-) represents a global climate model 

359 predicting average temperatures, then common model errors could be not including processes 

360 such as clouds, CO2 emissions from vegetation etc., error in the specification might be using an 

361 unduly simple model of economic activity, and imperfect implementation would include using 

362 grid cells too large to accurately solve the underlying differential equations. In some cases it 

363 may be necessary to consider model and measurement error, e + e say, and model each process 

364 separately. For stochastic models, as far as we are aware, no guidance exists about how to model 

365 the error, and indeed it is not clear what e should represent. 

366 To complicate matters further, for many models the dimension of D and X will be large, 

367 making it likely that the acceptance rate 7re(X — D) will be small. As noted above, in this case it 

368 is necessary to summarize the model output and the data using a multidimensional summary S{-). 

369 Using a summary means that rather than approximating t:{9 \ D), the algorithms approximate 

370 7r(^ I S{D)). The interpretation of e as model or measurement error still holds, but now the error 

371 is on the measurement S{D) or the model prediction S{X). If each element of S{-) has an 

372 interpretation in terms of a physical process, this may make it easier to break the error down into 

373 independent components. For example, suppose that we use S{x) = (x, Sxx), the sample mean 

374 and variance of X, and that we then use the following acceptance density 
375 
376 

377 This is equivalent to assuming that there are two sources of model error. Firstly, the mean predic- 

378 tion is assumed to be wrong, with the error distributed with density 7ri(-). Secondly, it assumes 

379 that the model prediction of the variance is wrong, with the error having distribution vr2(-). It 

380 also assumes that the error in the mean prediction is independent of the error in the variance 

381 prediction. This independence is not necessary, but helps with visualization and elicitation. For 

382 this reason it can be helpful to choose the different components of S{-) so that they are close 

383 to independent (independence may also help increase the acceptance rate). Another possibility 

384 for choosing S{-) is to use principal component analysis (if dim(X) is large) to find a smaller 



7r,(5(X) - S{D)) = 7ri(X - D)'K2{sxx - sdd)- 
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385 number of uncorrelated summaries of the data which may have me aningful interpretations. In 



386 general however, it is not known how to choose good summaries. I Joyce & MarjoramI (120081) 

387 have suggested a method for selecting between different summaries and for deciding how many 

388 summaries it is optimal to include. However, more work is required to find summaries which are 

389 informative, interpretable and for which we can describe the model error. 

390 Finally, once we have specified a distribution for e, we may find the acceptance rate is too small 

391 to be practicable and that it is necessary to compromise (as in Example |2] above). A pragmatic 

392 way to increase the acceptance rate is to use a more disperse distribution for e. This moves us 

393 from the realm of using e to model an error we believe exists, to using it to approximate the true 

394 posterior. This is currently how most ABC methods aie used. However, even when making a 

395 pragmatic compromise, the interpretation of the approximation in terms of an error will allow us 

396 to think more carefully about how to choose between different compromise solutions. 



Example 3. One of the first uses of an ABC algorithm was by lPritchard et al. d 19991) . who used 



a simple stochastic model to study the demographic history of the Y chromosome, and used an 
approximate rejection algorithm to infer mutation and demographic parameters for their model. 
Their data consisted of 445 Y chromosomes sampled at eight different loci from a mixture of 
populations from around the world, which they summarized by just three statistics: the mean 
(across loci) of the variance of repeat numbers V, the mean effective heterozygosity H, and 
the number of distinct haplotypes A^. The observed value of the summaries for their sample 
was D = {V, H, N)"^ = (1.149, 0.6358, 316)^. They elicited prior distributions for the mutation 
rates from the literature, and used diffuse priors for population parameters such as the growth rate 
and the effective number of ancestral Y chromosomes. Population growth was modelled using 
a standard coalescent model growing at an exponential rate from a constant ancestral level, and 
various different mutation models were used to simulate sample values for the three summaries 
measured in the data. They then applied Algorithm A using the metric 



All ^i^,^;=l 

413 

414 where X is a triplet of simulated values for the three summaries statistics. They used a toler- 

415 ance value of (5 = 0.1, which for their choice of metric corresponds to an error of 10% on each 

416 measurement. This gives results equivalent to assuming that there is independent uniform mea- 

417 surement error on the three data summaries, so that the true values of the three summaries have 

418 the following distributions 



y ~Zi[1.0341, 1.2624], ^ Z^[0.58122, 0.71038], iV ~ ^^[284, 348]. 



419 

420 

42 1 iBeaumont et all (|2002l ) used the same model and data set to compare the relative performance of 

422 Algorithm A with an algorithm similar to Algorithm B, using an Epanechnikov density applied 

423 to the metric value ^ for the acceptance probability '/re(-). They set a value of 5 (the cut-off in 

424 Algorithm A and the range of the support for e in Algorithm B) by using a quantile Ps of the 

425 empirical distribution function of simulated values of p{D,X), i.e.. Po ol means they accepted 

426 the 1% of model runs with values closest to D. They concluded that Algorithm B gives more 

427 accurate results than Algorithm A, meaning that the distribution found using Algorithm B is 

428 closer to the posterior found when assuming no measurement error (6 = 0). 

429 The conclusion that Algorithm B is preferable to Algorithm A for this model is perhaps not 

430 surprising in light of what we now know, as it was not taken into account that both algorithms 

431 used the same value of 6. For Algorithm A this corresponds to assuming a measurement er- 

432 ror with variance 6"^ /3, whereas using acceptance probability ^ is equivalent to assuming a 
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4. Approximate Markov chain Monte Carlo 

For problems which have a tightly constrained posterior distribution (relative to the prior), 
repeatedly drawing independent values of 9 from its prior distribution in the rejection algorithm 
can be inefficient. For problems with a high dimensional 9 this inefficiency is likely to make the 
application of a rejection type algorithm impracticable. The idea behind Markov chain Monte 
Carlo (MCMC) is to build a Markov chain on 9 and correlate successive observations so that 
more time is spent in regions of high posterior probability. Most MCMC algorithms, such as the 
Metropolis-Hastings al gorithm, depend on kno wledge of the likelihood function which we have 



assumed is not known. [Marjoram et al.l (120031) give an approximate version of the Metropolis- 
Hastings algorithm, which approximates the acceptance probability by using simulated model 
output with a metric and a 0-1 cut-off to approximate the likelihood ratio. This, as before, is 
equivalent to assuming uniform error on a set defined by the metric and the tolerance. As above, 
this algorithm can be generalized from assuming uniform measurement error to an arbitrary 
error term. Below, are two different algorithms to perform MCMC for the model described by 
Equation ([T]). The difference between the two algorithms lies in the assumption about on which 
space we choose to construct the stationary distribution. In Algorithm C we consider the state 
variable to belong to the space of parameter values 0, and construct a Markov chain {^i, ^2, • • •} 
which obeys the following dynamics: 



433 measurement error with variance 5^/5. Therefore, using Algorithm B uses measurement error 

434 only 60% as variable as that assumed in Algorithm A, and so it is perhaps not surprising that 

435 Algorithm B gives more accurate results in this case. 
436 

437 
438 
439 
440 
441 
442 
443 
444 
445 
446 
447 
448 
449 
450 
451 
452 
453 
454 
455 
456 
457 

Algorithm C: probabilistic approximate MCMC 1 

-^q CI At time t, propose a move from 9t to 9' according to transition kernel q{9t, 9'). 
C2 Simulate X' - r]{9'). 
C3 Set 9t+i = 9' with probabihty 

462 7T,{D-X') . / q{9',9t)7:i9') \ 

464 , ■ n n 

.^r otherwise set fe'^+i = 9t. 

465 

466 

An alternative approach is to introduce the value of the simulated output as an auxiliary variable 
and construct the Markov chain on the space G x ^, where X is the space of model outputs. 

469 

470 Algorithm D: probabilistic approximate MCMC 2 

472 Dl At time t, propose a move from il)t = {9t, Xt) to tj/ = {9', X') with 9' drawn from transition 

472 kernel q{9t, 9'), and X' simulated from the model using 9': 

473 X' ~ 7^(9') 
414 

475 D2 SetV't+i = {9',X') with probability 

477 r{{9t,Xt),{9 ,X)) = mm I 1,—- , (5) 

' ' V TTeiD- Xt)q{9t,9')7r{9t)J 

4/0 

479 otherwise set ipt+i = ipt- 

480 
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481 Proof of convergence. To show that these Markov chains converge to the required posterior 

482 distribution, it is sufficient to show that the chains satisfy the detailed balance equations 
483 
484 



p{s, t)iT{t) = p{t, s)tt{s) for all s, t 



485 where •) is the transition kernel of the chain and 7r(-) the required stationary distribution. 

486 For Algorithm C the transition kernel is the product of q{9, 9') and the acceptance rate. To 

487 calculate the acceptance rate, note that in Equation (lH) the acceptance probability is conditioned 

488 upon knowledge of X' and so we must integrate out X' to find r{9, 9'). This gives the transition 

489 kernel for the chain: 

Z PC. «') - / ^^^^ -in (l. 4^^) AX' I .)dX'. 

Jx c V q{9t,9')7r{9t)J 

The target stationary distribution is 

494 
495 
496 
497 
498 
499 



tt{D\9)= [ TT^{D - X)tt{X \9)dX. 
Jx 



It is then simple to show that the Markov chain described by Algorithm C satisfies the detailed 
balance equations. 

For Algorithm D, the transition kernel is 

500 ,n n'.^r^nnr.^- f 7T,{D - X')q{9' , 9t)n{9') 

501 
502 
503 
504 



The Markov chain in this case takes values on in x ^ and the required stationary distribution 
is 

505 .(^.X|Z)) = '-'°--^'"-fl'"'"'> (7) 

506 T'i.D) 

507 It can then be shown that Equations Q and <^ satisfy the detailed balance equations. □ 
508 

509 

510 While Algorit hm C is more recognis able as a generalization of the approximate MCMC algo- 



511 rithm given in Marjor am et all (120031) . Algorithm D is likely to be more efficient in most cases. 

512 This is because the ratio of model error densities that occurs in acceptance rate Q is likely to 

513 result in larger probabilities than those given by Equation (01) which simply has a. TTe{D — x)/c 

514 term instead. Algorithm D also has the advantage of not requiring a normalizing constant. 
515 

516 
517 

518 5. Extensions 

519 5-1. Monte Carlo integration 

520 Suppose our aim is to calculate expectations of the form 
521 

522 E{f{9) \D)= f{9)7T{9 \ D)d9 

523 

524 where the expectation is taken with respect to the posterior distribution of 9. The simplest way 

525 to approach this is to draw a sample of 9 values {^i}j=i,...,ra, from 7r{9 \ D) using Algorithm B, 

526 C or D and then approximate using the sum n"^ ^ /(^i)- However, a more stable estimate can 

527 be obtained by using draws from the prior weighted by -^^{0 — Xj) as in Algorithm B. For each 

528 9 drawn from the prior in step Bl, assign it weight Wi = ■k{D — Xi). Then an estimator of the 
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529 required expectation is 

«^ E/(«>. 



532 

533 Note that all values of 9 drawn from the prior are used in the sum; the re is no rejection step. This 



534 is a direct extension of the estimate given in lBeaumont et all (|2002h which used Epanechnikov 

535 kernels to weight each value of 6. 
536 

537 

538 5-2. Approximate model selection 

539 The theoretical acceptance rate from the rejection algorithm (Algorithm A with (5 = 0) is equal 

540 to the model evidence vr(D). The evidence from different models can then be used to calculate 



541 Bayes factors which can be used to perform model selection (iKass & Rafteryl (Il995h ). It is possi- 

542 ble to approximate the value of vr(D) by using the acceptance rate from Algorithm A. By doing 

543 this for two or more competing models, w e can perform app roximate model selection, although 

544 in practice this approach can be unstable ((Wilkinsonl (12007|)). The estimate of tt{D) can be im- 



545 proved and made interpretable by using the weighted estimate 
546 

548 ^^ri^p^ 

549 ^ 

where X} , . . . , X^^ ~ vi^i) ^^id 9i, . . . ,9n ~ vr(-). This gives a more stable estimate than sim- 
ply taking the acceptance rate, and also tends to the exact value (as n, m — oo) for the model 

552 given by Equation ([T]). 

553 
554 
555 

555 6. Discussion 

557 It has been shown in this paper that approximate Bayesian computation algorithm can be 

558 considered to give exact inference under the assumption of model error. However, this is only 

559 part of the way towards a complete understanding of ABC algorithms. In the majority of the 

560 application papers using ABC methods, summaries of the data and model output have been used 

56 1 to reduce the dimension of the output. It cannot be known whether these summaries are sufficient 

562 for the data, and so in most cases the use of summaries means that there is another layer of 

563 approximation. While this work allows us to understand the error assumed on the measurement 

564 of the summary, it says nothing about what effect using the summary rather than the complete 

565 data has on the inference. More work is required to discover this, and to help guide the choice of 

566 which summaries to use. 

567 The use of a model error term when making inferences is important if one wants to move 

568 from making statements about the model to statements about reality. There has currently been 

569 only minimal work done on modelling the discrepancy term for stochastic models. One way to 

570 approach this is to view the model as deterministic, outputting a density 7:e{x) for each value of 

571 the input 9 (many realizations of r]{9) would be needed to learn iTg{x)). The discrepancy term 

572 e can then be considered as representing the difference between 7r0{x) and the true variability 

573 inherent in the physical system (at least the variability given the level we choose to model it 

574 at). Finally, it should be possible to generalize approximate sequential Monte Carlo methods in 

575 a similar way to that done for the approximate rejection and approximate Markov chain Monte 

576 Carlo algorithms. 



Approximate Bayesian computation (ABC) gives exact results under the assumption of model errorX 3 



577 Acknowledgements 

578 I would like to thank Professor Paul Blackwell for his suggestion that the metric in the ABC 

579 algorithm might have a probabilistic interpretation. I would also like to thank Professor Simon 

580 Tavare, Professor Tony O'Hagan, and Dr Oakley for useful discussions on an early draft of the 

581 manuscript. 
582 

583 

584 References 

585 Beaumont, M. A., Cornuet, J. M., Marin, J. M. & Robert, C. R (2008). Adaptivity for ABC algorithms: the 

586 ABC-PMC scheme. In submission . 

co-j Beaumont, M. A., Zhang, W. & Balding, D. J. (2002). Approximate Bayesian computatation in population 

^ genetics. Genetics 162, 2025-2035. 

588 Blum, M. G. B. & Francois, O. (2008). Highly tolerant likelihood-free Bayesian inference: an adaptive non-linear 

589 heteroscedastic model. In submission, available as arXiv:0809.4178vl . 

Blum, M. G. B. & Tran, V. C. (2008). Approximate Bayesian computation for epidemiological models: applica- 
^ tion to the Cuban HIV- AIDS epidemic with contact-tracing and unobserved infectious population. In submission, 

available as arXiv: 08 10.0896V 1 . 

592 Campbell, K. (2006). Statistical calibration of computer simulations. Reliab. Eng. Syst. Safe. 91, 1358-1363. 

593 CoRNUET, J. M., Santos, F., Beaumont, M. A., Robert, C. R, Marin, J. M., Balding, D. J., Guille- 

MAUD, T. & Estoup, a. (2008). Inferring population history with DIYABC: a user-friendly approach to approx- 
imate Bayesian computation. Bioinformatics , to appear. 

595 Diggle, p. J. & Gratton, R. J. (1984). Monte Carlo methods of inference for implicit statistical models. J. R. 

596 Statist. Soc. B 46, 193-227. 

^g-j FOLL, M., Beaumont, M. A. & Gaggiotti, O. (2008). An approximate Bayesian computation approach to over- 

come biases that arise when using amplified fragment length polymorphism markers to study population structure. 

598 Genetics 179, 927-939. 

599 Goldstein, M. & Rougier, J. C. (2008). Reified bayesian modelling and inference for physical systems (with 
gQQ discussion). / Stat. Plan. Infer. , to appear. 

Hamilton, G., Currat, M., Ray, N., Heckel, G., Beaumont, M. A. & Excoffier, L. (2005). Bayesian 
estimation of recent migration rates after a spatial expansion. Genetics 170, 409^17. 
602 HiGDON, D., Gattiker, J., Williams, B. & Rightley, M. (2008). Computer model caUbration using high- 

503 dimensional output. J. Am. Statis. Assoc. 103, 570-583. 

. Joyce, R & Marjoram, R (2008). Approximately sufficient statistics and Bayesian computation. Stat. Appl. 

Genet. Mo. B. 7, article 26. 

605 Kass, R. E. & Raftery, A. E. (1995). Bayes factors. J. Am. Statis. Assoc. 90, 773-795. 

606 Kennedy, M. & O'Hagan, A. (2001). Bayesian caUbration of computer models (with discussion). J. R. Statist. 
Soc. B 63, 425^64. 

Liu, J. S. (2001). Monte Carlo Strategies in Scientific Computing. Springer Series in Statistics. Springer 
Marjoram, P., Molitor, J., Plagnol, V. & Tavare, S . (2003). Markov chain Monte Carlo without Ukelihoods. 
609 Proc. Natl. Acad. Sci. USA 100, 15324-15328. 

Murray, I., Ghahramani, Z. & MacKay, D. J. C. (2006). MCMC for doubly-intractable distributions. In 

,^ ^ Proceedings of the 22nd Annual Conference on Uncertainty in Artificial Intelligence. UAI. 

Plagnol, V. & Tavare, S. (2004). Approximate Bayesian computation and MCMC. In Proceedings of Monte 
612 Carlo and Quasi-Monte Carlo Methods 2002, H. Niederreiter, ed. Springer- Verlag. 

523 Pritchard, J. K.. Seielstad, M. T., Perez-Lezaun, a. & Feldman, M. W. (1999). Population growth of 

human Y chromosomes: a study of Y chromosome microsatellites. Mol. Biol. Evol. 16, 1791-1798. 
Ratmann, O., Jorgensen, O., Hinkley, T., Stumpf, M., Richardson, S. & Wiuf, C. (2007). Using 

615 Ukelihood-free inference to compare evolutionary dynamics of the protein networks of H. pylori and P. falciparum. 

616 PLos Comput. Biol. 3, 2266-2276. 

Siegmund, K. D., Marjoram, P. & Shibata, D. (2008). Modeling DNA methylation in a population of cancer 
cells. Stat. Appl. Genet. Mo. B. 7, article 18. 

618 SissoN, S. A., Fan, Y. & Tanaka, M. M. (2007). Sequential Monte Carlo without Ukelihoods. Proc. Natl. Acad. 

619 Sci. USA 104, 1760-1765. 

520 Tanaka, M. M., Francis, A. R., Luciani, F. & Sisson, S. A. (2006). Using approximate Bayesian computation 

,^ to estimate tuberculosis transmission parameters from genotype data. Genetics 173, 15 1 1-1520. 

Wilkinson, R. D. (2007). Bayesian inference of primate divergence times. Ph.D. thesis. University of Cambridge. 

622 
623 
624 



607 
608 



